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THE LASSO, CORRELATED DESIGN, AND IMPROVED 
ORACLE INEQUALITIES* 

By Sara van de Geer and Johannes Lederer 

ETH Zurich 

I We study high-dimensional linear models and the ^i-penalized 

(•*~^ least squares estimator, also known as the Lasso estimator. In liter- 

^N| ature, oracle inequalities have been derived under restricted eigen- 

I value or compatibility conditions. In this paper, we complement this 

^ with entropy conditions which allow one to improve the dual norm 

^~5 bound, and demonstrate how this leads to new oracle inequalities. 

T-H The new oracle inequalities show that a smaller choice for the tuning 

parameter and a trade-off between ^i-norms and small compatibil- 
ity constants are possible. This implies, in particular for correlated 
design, improved bounds for the prediction error of the Lasso esti- 
mator as compared to the methods based on restricted eigenvalue or 
compatibility conditions only. 
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1. Introduction. We derive oracle inequalities for the Lasso estimator 
'T^ for various designs. Results in literature are generally based on restricted 

Q>^ eigenvalue or compatibility conditions (see Section 3 for definitions). We refer 

00 to [2], [4], [5], [6], [8], [10], [11]. See also [3] and the references therein. In a 

^..^ sense, compatibility or restricted eigenvalue conditions and the so-called dual 

norm bound we describe below belong together. In contrast, if compatibility 
(<— «) constants or restricted eigenvalues are very small, the design may have high 

T— I correlations, and then the dual norm bound is too rough. In this paper, we 

'T^ discuss an approach that joins both situations. The work is a follow-up of 

<*' [12]. It combines results of the latter with the parallel developments in the 

K> area based on the dual norm bound. 

d We consider an input space X and p feature mappings ipj : X ^ M, j = 

l,...,p. We let [xi, . . . ,Xn)'^ G X^ be a given input vector, and Y := 
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2 S. VAN DE GEER ET AL. 

(Yi, . . . , Yn)'^ G M" be an output vector, and consider the linear model 

i=i 

with e G M" a noise vector, and f3^ £ MP a, vector of unknown coeffi- 
cients. Here, with some abuse of notation, ^|Jj denotes the vector ipj = 
{■ipj{xi), . . . , ■ipj{xn))'^ ■ The design matrix is X := {tpi, . . . , V'p) and the Gram 
matrix is 

± := X'^X/n. 

Throughout, we assume that J27=i ''P'ji^i) — "■ ^^^ ^^^ J- 

We write a linear function with coefficients /3 as /^ := X]?=iV'j/3j) (3 gMP. 
The Lasso estimator is 



/3:=argmm|||Y-/^||2/n + A||/3||i|. 



We denote the estimator of the regression function f^ := /^o by / := /s. 

Oracle results using compatibility or restricted eigenvalue conditions are 
based on the dual norm bound 

sup |e fjs\/n= max |e ^j|/n. 
||/3||i=l i^<j<p 



Let us define 



ll^ll^=E/l(^^)/"=/5'^^/3• 



The point we make in this paper is that the dual norm bound does not take 
into account possible small values for H/s — //3o||n- Our results are based on 
bounds for 

sup |e^//3|/n 

ll^l|l<l, ll/,3l!n<i? 

as function of i? > 0. We then apply these to /3 — /3'^ (or f3^ here replaced by a 
sparse approximation). We use an improvement of the dual norm bound, and 
show in Theorem 4.1 the consequences. The main observation here is that 
with highly correlated design, one can generally take the tuning parameter 
A of much smaller order than the usual y/log p/n. Moreover, small compat- 
ibility constants may be traded off against the ^i-norm of the coefficients of 
an oracle. 
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2. Organization of the paper. In Section 3, we present our notation, 
and the definitions of compatibility constants and restricted eigenvalues. 
Section 4 contains the main result, based on a pre-assumed improvement 
of the dual norm bound. In Section 5, we present a result from empirical 
process theory, which shows that the improvement of the dual norm bound 
used in Section 4 holds under entropy conditions onJ^:={/^: ||/3||i = l}. 
In Section 6, we first give a geometrical interpretation of the compatibility 
constant and discuss the relation with eigenvalues. The next question to 
address is then how to read off the entropy conditions directly from the 
design. We show that a Gram matrix with strongly decreasing eigenvalues 
leads to a small entropy of J- . Alternatively, we derive an an entropy bound 
for T based on the covering number of the design {V'j}, a result much in the 
spirit of [7]. We moreover link these covering numbers with the correlation 
structure of the design. Section 7 concludes and Section 8 contains proofs. 

3. Notation and definitions. 

3.1. The compatibility constant. Let S C {l,...,p} be an index set with 
cardinality s. We define for all /3 £W, 

f3s,j ■■= mJ eS}, j = l,... ,p, (3sc -.= (3- Ps. 

Below, we present for constants L > the compatibility constant (j){L, S) 
introduced in [10]. For normalized ipj (i.e., HV'jIln = 1 for all j), one can 
view l — (p'^{l,S)/2 as an £i-version of the canonical correlation between the 
linear space spanned by the variables in S on the one hand, and the linear 
space of the variables in S^ on the other hand. Instead of all linear combi- 
nations with normalized ^2-iiorm, we now consider all linear combinations 
with normalized £i-norm of the coefficients. For a geometric interpretation, 
we refer to Section 6. 

Definition The compatibility constant is 

(t>\L,S):=mm{s\\fp^-fp,4l: ||/3s||i = 1, \\M\i < L}. 



The compatibility constant is closely related to (and never smaller than) the 
restricted eigenvalue as defined in [2], which is 

cI>UL,S) = mJMis_f^ , ii^^^ii^ < L||^5||i|. 
I Wpsh J 
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See also [8], and see [13] for a discussion of the relation between restricted 
eigenvalues and compatibility. 

3.2. Projections. As the "true" (3^ is perhaps only approximately sparse, 
we will consider a sparse approximation. The projection of f^ := /^o on the 
space spanned by the variables in S is 

is ■= arg min ||/- /°||„. 

f=ffJS 

The coefficients of fg are denoted by b^ , i.e., 

is = fbs ■ 
Note that f^ only has non-zero coefficients inside S, that is, {b )s = b . 

4. Main result. We let Ta be the set 

Ta '■= i sup -, — < Ar 

Here, < a < 1 and Aq > are fixed constants. 
Note that on Ta, 

sup |e^/^|/n<Aoi?'-"/4, 

ll/9||l=l, \\fp\\n<R 

i.e., we have a refinement of the dual norm bound described in Section 1. 

Note that for fixed Aq and for a < d, it holds that Ta C Ta- This is because 
by the triangle inequality 



Y.^j/3j\\n<Y.m\nm< 



We want to choose a preferably small, yet keep the probability of the set Ta 
large. For a = 1, one has 

7i = < max 4|e ipj\/n < Aq >, 
[^<j<p J 

by the dual norm bound. Thus, e.g. when e ~ A/'(0, /), the probability P(7i) 
of 7i is large when Aq x y^log p/n. We detail in Section 5 how one can 
lowerbound P(7^) for a proper value of a depending on the design {tpj}- 
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Generally, the value for Ao will be of order y^logp/n, as in the case a = 1, 
or A X y^log n/n or even Ao ^ l/\/n. 



The choice of the tuning parameter A depends on Aq. The following technical 
lemma will be used: 

Lemma 4.1. Let < a < 1 and let a, b and Xq be positive numbers. Then 

2 



Aoa^~"6" < -a^ + Xb+ , , 
- 2 2 V A" 



Here, when a = 1, 



Ao^ 
A"y 


2 


Ao^ 


oo 


OO 
1 




A< Ao 
A = Ao. 
A> Ao 



In the proof of the main result, Theorem 4.1, we invoke Lemma 4.1 to handle 
the "noise part" e"^//3 with fi = J3 — f3^ (or actually with fi^ replaced here by 
a sparse approximation). On Ta-, it holds that 

2 

^\e^fp\/n<\\\f^\\l + \Wh + \{^^^"'^ 

uniformly in /3 G M^. In the right hand side of this inequality, the first term 
11/^11^/2 can be incorporated in the risk and the second term A||/3||i will be 

2 

overruled by the penalty. Finally, the third term (Ao/A") i-^ /2 governs the 
choice of the tuning parameter A. 

We now come to the main result. We formulate it for an arbitrary index set 
5 partitioned in sets Si and 52 in an arbitrary way. We will elaborate on the 
choice of S in Remarks 4.2 and 4.5. Corollaries 4.1 and 4.2 take for a given 
S some special choices for the tuning parameter A and for the partition of 
S into Si and 52. 

Recall that is is the projection of f^ = fao and b are the coefficients of is- 

Theorem 4.1. Let 5 be an arbitrary index set, partitioned into two sets 
Si and S2, i.e. S = SiU S2, 5i n 52 = 0. Let si be the cardinality of Si. Let 
Ta be the set 
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Then on Ta, 



2 



p0||2 , \||3 ,sii ^ 56A Si 28 5 , ''/ ^oV"" iviif f0\\2 



ii/-rii^+Aii/3-6iii < ^2(^+Y^ii(^ )s2iii+g^^j +ms-fTn- 

Remark 4.1. We did not attempt to optimize the constants we provided 
in Theorem, 4-1- 

Remark 4.2. Given a value of the tuning parameter X, we can now define 
the estimation error using the variables in S as 

The oracle set S"* is then the set which trades off estimation error and ap- 
proximation error, i.e, the set S-t that minimizes 

£{S) + \\is-fYn- 

Note that S^ depends on X, say 5^, = S^{X). The best value for the tuning 
parameter X* is then obtained by minimizing 

2 
^0 T"" , llf f0u2 



ns.w) + l[^y~" + ¥s.w-f 



In" 



Remark 4.3. In practice, the tuning parameter X can be chosen by cross- 
validation. As this method tries to mimic minimization of the prediction 
error, it can be conjectured that one then arrives at rates at least a good as 
the ones we discuss here choosing values of X depending on the design, the 
(unknown) error distribution, and the unknown sparsity. This is however 
not rigorously proven. 

Remark 4.4. We have restricted ourselves to improvements of the dual 
norm bound of the form given by sets Ta- The situation can be generalized 
by considering sets of the form 

4|6^//3|/n 

T G-H\\Mn/m\i)m\i - ' 



where G is a given increasing convex function with G(0) = 0. 

Corollary 4.1. 

a) If we take S2 = 0, we have Si = S, and si = \S\ =: s. This is a good 
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choice when the compatibility constants are large for all subsets of S. With 
the choice 

we get on Ta, 

\\f-f\\l + m-b'h = o(^xl(^-^^y + \\fs-fYn 

Recall that the dual norm bound has a = 1. With Aq >i y^logp/n we then 
arrive at the "usual" oracle inequality as provided by, among others, [2], [4], 
[5], [6], [8] [10], [11]. When a < 1, the compatibility constant may be very 
small, as the design is highly correlated. The effect is however somewhat 
tempered by the power a in the bound, 
b) More generally, let 



Then on Ta, 


wf-fX+m-if'h 




-o(\^( '' " 




f0\\2 
J \\n 


^ [^'^{6,8,), 


Corollary 4.2. 







a) With the choice Si = 0, the result does not involve the compatibility 
constant. This may be desirable when the design is highly correlated. The 
result then corresponds to what is sometimes called "slow rates", although 
we will see that when a < 1, the rates can still be much faster than 1/^/n. 

2 

When a = 1, we must take A > Aq (due to the term (Ao/A°)i-«J. When 
a < 1, we choose 



2 1-Q 

\ ^^ \ l + Q \\uS W 1 + a 

A ^ Ag \\b 11^ 



We get on Ta, 



11/ - fYn + All/3 - 6^||l = O ( Ao^+" ||6^|||" + ||fs - fXn 



b) More generally, let 



2 l-a 

l + a II /'uS\ II 1 + a 



\^\r\w)sA 
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Then on Ta, 

11/ -/°lln + All/3 -6^111 



4 



0||2 



'^n ^ 

S2II1 



0| '-^lT^la7^^+K^^\\i^')s.\\l^''+\\^s-f 



Remark 4.5. Note that by taking Si smaller, the value of si/(j)'^{Q, Si) 
will not increase, but on the other hand, the value of ||(^'^)s2lli ^*^^ become 
larger. Thus, the best rate will emerge if we trade off these two effects. Indeed, 
suppose that for some Si 



2 2 



Then on Ta, for 



(l-a)/2 ^_ i-c 



0||2 I \||3 r5|| _/n/ \2/ ^1 \ I llf #0||2 



we have 

\\f-ni+m-n.-o[^K[-^^^)) +11%-/" 
= o(A»^ii{6«)sjir + iifs-/°ii; 

In particular for the case a < 1, it is however not clear when such a trade- 
off is possible. It may well be that for any Si, si/(p'^{6, Si) either heavily 
dominates or is heavily dominated by the ii-part ||(&'^)52lli- 'S'ee Section 6 
for a further discussion. 



5. Improving the dual norm bound. In this section, we provide proba- 
bility bounds for the set Ta introduced in Section 4. The results follow from 
empirical process theory, see e.g. and [14] and [15]. Theorem 5.1 is taken 
from [3]. 

Definition Let T be a class of real-valued functions on X . Endow T with 
norm || • ||„. Let 5 > be some radius. A 5-packing set is a set of functions 
in T that are each at least 6 apart. A (5-covering set is a set of functions 
{01, . . . , (Pn}, such that 

sup min \\f - (pk\\n < S. 

f^jrk=l,...,N 
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The J-covering number N{5,J^, \\ ■ \\n) of T is the minimum size of a 5- 
covering set. The entropy of T is H{-,T, \\ • ||„) = logA^(-,-F, || • ||„). 

It is easy to see that N{6, T, \\ ■ \\n) can be bounded by the size of a maximal 
5-packing set. 

We assume the errors are sub-Gaussian, that is, for some positive constants 
K and o"o. 



(5.1) 



K'(Eexp[ef/K'] - 1) < a'^, i = 1, . . . ,n. 



The following theorem is Corollary 14.6 in [3]. It is in the spirit of a weighted 
concentration inequality, and uses the notation 

x+ := max{x, 0}. 

Theorem 5.1. Assum.e (5.1). Let T he a class of functions with \\f\\n < 1 
for all f ^ T , and with, for some < a < 1 and some constant A, 



Define 



and 



logil + 2Ni5,T,\ 



B := exp 



< 



A^'^a 



A 



2a 



, 0< 5< 1. 



2(21-" _ 1)2 



ifo:=3x2Yi^2 + ^2_ 



It holds that 

Eexp 



sup 



|6^/|/V^ A^ 



O-W f\\—a 



Kn 2i-° - 1 



n-^0 



< 1 + 2/5. 



Corollary 5.1. Assume the conditions of Theorem 5.1. Chebyshev's in- 
equality shows that for all t > 0, 

P(3 / G J- : |e^/|/V^ > ifo^"||/||^"(2i-" - 1)-' + Ko||/||nt) 



<exp[-t\l + 2/B). 
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Corollary 5.2. Consider now linear functions 

p 



where HV'jIln ^ 1- Then 



< 



Hence, {//3/||/3||i : /3 G M^} is a class of functions with \\ ■ \\n-norni bounded 
by 1. Suppose now 



^x2a 



log( 1 + 2N{6, {fp : ||/3||i = 1}, II • ||„)J < ( - 1 ,0<6<1. 

Under the sub-Gaussianity condition (5.1), we then have for all t > and 
for 



jn 

the lower bound 

P(7;)>l-exp[-t2](l + 2/S). 

6. Compatibility, eigenvalues, entropy and correlations. We study 
the set 

T:={fp -.11(311, = I}. 

It is considered as subset of L2{Qn), where Q„ := J27=i ^xjn. The L2{Qn)- 
norm is || • \\n. 

6.1. Geometric interpretation of the compatibility constant. We first look 
at the minimal ^i -eigenvalue 

Ali^^AS) ■■= minL/3f S/3s : ||/3s||i = l} 

as introduced in [3]. Note that A.rain,iiS) / \/^ is the minimal distance be- 
tween any point /^^ with H/SsHi = 1 and the point {0}. We tacitly assume 
that the {xpj}j£s are linearly independent. The set {/^^ : ||/35||i = 1} 
is then an £i-version of a sphere: it is the boundary of the convex hull of 
{i/jj}j^s U {—ipj}j(zs in s-dimensional space with {0} in its "center". It is a 
parallelogram when s = 2 (see Figure 1) and then a rectangle when the i/^j, 
j £ S, have equal length. 
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AniirL,l(5' 




Fig 1. Left panel: the set A = {f^^ : ||/9s||i = !}■ Right panel: £\- and (.2 -eigenvalues. 



Let Y^s be the Gram matrix of the variables in S and h^^^^{S) be the minimal 



(£2-)eigenvalue of the matrix S^ 



A^i,(5):=min /3p/3s: Wsh = l 



Then 



^min,l('5') > ^min('S') > ^rainAw)/^^ 



One can construct examples where A.^:^^{S) is as small as 3/(s — 2) ( s > 2) 
and A^jj^ -^(5) is at least 1/2 (see [13]), that is, they can differ by the maximal 
amount s in order of magnitude. See also Figure 1 which is to be understood 
as representing a case s > 2. Thus, minimal £1 -eigenvalues can be much 
larger than minimal (£2-)eigenvalues. The normalized compatibility constant 
0(L, S)/ y/s is the minimal distance between the sets A := {fpc; '■ ll/^sHi = 1} 
and B := {f/s^^ : \\f3s'^\\i < L}, that is, 



HL,S) 



mm 



ae A, b e B 



See Figure 2 for an impression of the situation. Observe that A is the bound- 
ary of the convex hull of {+ijjj}ji^s U {— V'jijes, and B is the convex hull of 
{-|-V'j}jeS'cU{— ^jljgS'c including its interior, blown up with a factor L (typ- 
ically, the {ipj}j£S'' form a linearly dependent system in M"). Furthermore, 
since {0} £ B 

This shows that when £1 -eigenvalues are small, the compatibility constant 
is necessarily also small. Small ^2-eigenvalues may have less of this effect. 



6.2. Eigenvalues and entropy. We now let 
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(p(L.S)H. 



Fig 2. The compaUbihty constant 

be the spectral decomposition of the Gram matrix S, E being the matrix of 
eigenvectors, {E'^E = EE'^ = I) and il^ = diag(ci;^, • • • ,ujp) the matrix of 
(£2-)eigenvalues. We assume they are in decreasing order: iof > ■ ■ ■ > LOp. 

Lemma 6.1. Suppose that for some strictly decreasing function V 

i^]+i< V^ij), j = l,...,p. 

Then for all 5 > 0, 

H(25,{fp : ||/3||i = 1}, II • \U) < V-\5)log(^^y 

Example 6.1. Suppose that for some positive constants m and C 

C 

Then by Lemma 6.1, 

H{25,{f^: ||/3||i = l},||-||n)<(^)'"log0). 
For 6 > 1/n (say) we therefore have 

Hi6,{ff,: ||/3||i = l},MU)< (^^j log(6n). 

When m > 1/2, one can use a minor generalization of Corollary 5.2, where 
the entropy bound is only required for values of 6 > 1/n. One then takes 

a = ^, A = {Cllog{n))^ , 
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where Cm is a constant depending on m and C . Then the value of Xq defined 
there becomes 

Ao = —F^ —. i \- t 

Vn \ 2^-2^ - 1 / 

which is for fixed m and Kq, and a fixed (large) t, of order ^/logn/ 



n. 



6.3. Entropy based on coverings of {tpj}. We can consider {/^ : ||/3||i = 1} 
as a subset of 

conv({±V'j}), 

where {zizipj} := {ipj} U {— V'j}) find conv({ib'0j}) is its convex hull. Infact, 
if the {ipj} form a linearly dependent system in M", J^ is exactly equal to 
conv({ib^j}). 

The paper [7] gives a bound for the entropy of a convex hull for the case 
where the u-covering number of the extreme points is a polynomial in 1/u. 
This result can also be found in [9]. There is a redundant log-term in these 
entropy bounds, see [1] and [15], but removing this log-term may result in 
very large constants, depending on the dimension W as given in Example 6.2 
(see [3] for some explicit constants). This means that when the dimension 
W of the extreme points is large (growing with n say), the simple bound 
with log-term we provide below in Lemma 6.2 may be better than the more 
involved ones. 

We give a bound for the entropy of J- by balancing the u-covering number 
of {ijjj} and the squared radius u'^. The result is as in [9], with only new 
element its extension to general covering numbers (i.e., not only polynomial 
ones). Lemma 6.2 and its proof can be found in [3]. 

Lemma 6.2. Let 

N{u):=N{u,{iPj},\\-\\n), u>0. 

We have 

nUuf ll/3||i = i},M 

Example 6.2. In this example, we assume the u-covering numbers of {ipj} 
are bounded by a polynomial in u. That is, we suppose that for some positive 
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constants W and C, 



N{u,{i^,},\\-\\n)< ( 



- ,n> 0. 

u 



The constant W can be thought of as the dimension o/{^j}. By Lemma 6.2, 
we can choose 

W n ^ 

and we get, as in Example 6.1, 



n \ ^ 



22+W — I 



A refined analysis of tlie relation between compatibility constants, covering 
numbers and entropy is still to be carried out. We confine ourselves here to 
the following, rather trivial, observation (without proof). 

Lemma 6.3. Consider normalized design: \\ipj H™ = 1 V j. Let {ipj-^ , . . • , ^Pj^} 
be a maximal u-packing set of {^j}. Then for any S D {ji, . . . , jat}, S ^ 
{!,... ,p}, and any L > 1, 

(j)'^{L,S) <su^. 

One may argue that as u-packing sets are approximations of the original 
design {ipj} with fewer covariables, they are good candidates for the sparsity 
set 5*1 used in Theorem 4.1. Lemma 6.3 however shows that such sparsity 
sets will have very small compatibility constants. 

6.4. Decorrelation numbers. Decorrelation numbers are closely related to 
packing numbers. First, define the inner product 

p{(t),4>) ■.= 4>'^4>/n. 

Note that Sj_fc = p{^l'j, ^fc) and that in the case of standardized design (i.e. 
Y17=i ''Pji^i) — and llV'jIln = 1 V j), the inner product p{ipj,ipk) is for j ^ k 
the (empirical) correlation between tpj and ip^- 

Definition For p > 0, the p-decorrelation number M{p) is the largest value 
of M such that there exists {<pi, . . . , (J)m} C {±V'i} with \p{4>j, 4'k)\ < P for 
all j 7^ k. 
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Hence, if the /o-decorrelation number is small, then there are many large 
correlations, i.e., then the design is highly correlated. 

It is clear that when HV'jIln = ||V'fc||n = 1, it holds that 

llV'i -Ml = 2(1 - p{tpj,tpk))- 

In other words, small correlations correspond to covariables that are near to 
each other. This can be translated into covering number as shown in Lemma 
6.4. Its proof is straightforward and omitted. 

Lemma 6.4. Consider normalized design: \\ipj\\n = 1 Vj. For all <u <1, 

N{V2u,{±^l^J},\\■\\n)<M{l-u^). 

7. Conclusion. We have combined results for the prediction error of the 
Lasso with both compatibility conditions and entropy conditions. Small en- 
tropies of {//3 : 11/3 111 = 1} correspond to highly correlated design and 
possibly to small compatibility constants. Our analysis shows that small en- 
tropies allow for a smaller choice of the tuning parameter and possibly for a 
compensation of small compatibility constants. This means that the Lasso 
enjoys good prediction error properties, even in the case where the design is 
highly correlated. 

8. Proofs. 

Proof of Lemma 4.1. We use that for positive u and v and for p > 1, 

q> 1, l/p + l/q= 1, the conjugate inequality 

uv < u^/p + v'^/q 
holds. Taking p = 1/(1 — a) and replacing u by u^~'^ gives 

- 2 2 

With p = (1 + a)/(2a), and replacing u by ni+" , we get 

_2q_ 2a 1 — a i+B 

1 + a 1 + a 

Thus, 

2 
1 1 — CKo l+Ct/ \ 1+" 
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<y + ^W-,, 



2 



a^ 1 + a , , , , jia_ / Ao \ 1+ 



o^ 1 + a/ 2a ,, l-a/Ao\i-" 
-2 2 ll + a l + aVA" 



a , , 1 / An \ !-<= 



-2 2VA 



D 



Proof of Theorem 4.1. Since 

l|Y - f\\l/n + A||/3||i < ||Y - isWl/n + A||6^||i, 
we have the Basic InequaHty 

11/ - fWl + A||/3||i < 2e^(/ - is)/n + A||6^||i + 11% - fWl 
Hence, on Ta, 

11/ - fWl + A||/3||i < Aoll/ - isWi-'^W - h'\\'il2 + A||6^||i + \\is - f\\l 
Apply Lemma 4.1 to find 



ll/-/°ll^ + A 



< ^11/ - fell^ + ^A||/3 - 6^|K + 1 (^) ^"" + A||6^|k + life - /°||^. 
Thus, we get on Ta-, 



2 



'0||2 , o\||«|i / Mifl 1,511 , n\||i,5|i , 1/ AoV " , „,,,_ ,0||2 



11/ - ril^ + 2A||/3||i < All/3 - 6^ 111 + 2A||6^ Hi + 2 (^a^J + ^P^ " T Hn 
Defining 6*3 := S'^, we rewrite this to 

||/-/0||2+2A||/3s,u53lli 

< A||^5i - (6'')sJ|i + A||^52 - (fe'')52lli + A||/3s3l|i + 2A||(&^)5j|i - 2AII/35, 1 

2 

5^ II , V^oV"" , Qllf ^0||2 



+2A||(6^)52I|i + 2(a^) +3||f5-/ 
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2 



Sa II , MIfl II , Q\II/^,S^ II , 1/ AoV"" , Qllr .0||2 



<3X\\f3s,-ibnsAi + M\Ps,us,\\i + mins,\\i + ^iy-^j +ms-f 

Moving the term A||/35'2U53l|i to the left hand side, and applying a triangle 
inequality, we obtain 

\\f-f\\l + XWs,us,-{b')s,\\i 



2V A 



\ct 



2 



< 3AII/35, - (6^)5.111 +4A||(6^)5,||i + I ( ^ ) "" + 311% - /^''^ 



n ■ 



Case i. If / > II, we arrive at 

ll/-/°lln + A||/3s2U53 - {b')s,\\i < QXWPs, - {b')sA\i- 

We first add add a term A||/35'^ — (6'^)s'J|i to the left and right hand side 
and then apply the compatibility condition to $ — b^ , to get 

ll/-/°ll^ + A||/3-6^||i<|^||/-f5||n 

^■""llf f0||2 , '^llf f l|2 I 56A Si 

Here we used the decoupling device 

2xy < bx^ + y^/b \/ x,y e R, 6 > 0. 



So then 

11/ -frn+ 2AII/3 - b'h < i^i^r^, + ms - ni 



f0||2 , o\ll« J^^ii ^ 56A Si ^ ^11^ ^0||2 

'>2(6,5l, 



Case ii. If / < //, we get 

ll/-/°lln + A||/352U53-(^^)52l|l<2//, 

and hence 

11/ -/°ll'+ All/3 -6^||i<^// 

2 



yA||(6-)..lk + g(;^^^ 



'A|l(^")52lli + M^ +7||f5-/°||^. 
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Proof of Lemma 6.1. Let ||/3||i = 1. Then ||/3||2 < 1, and hence ||-E^/3||2 < 
1. For N < V^^{6) it holds that wat+i < 6 and hence 

j=N+l j=N+l 

We now note that ||/3||i = 1 imphes ||//3||n < 1 and hence 

N 

i=i 

Lemma 14.27 in [3] states that a baU with radius 1 in A^-dimensional Eu- 
chdean space can be covered by (3/(5) bahs with radius 6 (see also Problem 
2.1.6 in [15]). D 
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